removed the bibliography section and it knitted for me so we’ll need to add that to github

Checkpoint 4 information:

Ellen completed the implementation of the Pseudo-BMA on the candy dataset, and created most of the visualizations shown below. Esther wrote the Bayesian Model Averaging section, did pp_checks() on the models and analyzed their stability, while I wrote the section on AIC / BIC with Ellen’s help, and merged all work together, wrote up checkpoint 4, and added all comments.

**Next steps*: Make sure all the writing flows together, add more comments to code, perhaps add a conclusion, and pretty it all up a bit!

Introduction to Common Model Selection Critera

Model specification is an element of statistics whose importance is often glossed over. In most cases, there are multiple appropriate models for a set of data. The end choice of model can hugely impact the results, and classical methods offer limited guidance on the best process for accounting for the uncertainty this creates. Unstructured searches and checks for the best model specification can lead to incorrect inferences, fragile reported findings, and publication bias [@Montgomery]. Frequentist approaches to model selection include but are not limited to general lasso models, using the Akaike Information Criterion (AIC), and using the root mean squared error (RMSE) or R^2. We will thus elaborate a little on AIC and its related counterparts, BIC and WAIC.

AIC and Some Frequentist Approaches

AIC is most simply thought of as a measure of model ‘balance’, weighing a model’s ability to fit the given data while also considering the possibility for overfitting. The lower AIC score, the better. The AIC equation is given as:

\[AIC = -2 ln(L) + 2k\]

Where \(L\) is the model’s maximum log-likehood estimate, and \(k\) is the number of parameters in the model. While a high log-likelihoods help to decrease AIC, generally these are achieved through more parameters. Of course, the AIC value of a model is only relevant when compared to the AIC for other models. Given multiple models \(i\), you can estimate the probability a model \(i\) minimizes information loss as such, where \(AIC_{min}\) is the lowest score in the set of models.

\[P_i = e^{(AIC_{min} - AIC_i)/2)}\]

However, AIC does not necessarily translate well to the Bayesian realm - given that priors can be placed on parameters and on models, a Bayesian model could have a high log-likelihood but a low probability.

Another similar method is the Bayesian Information Criterion, given by:

\[BIC = k\text{ }ln(n) - 2ln(L)\] Where \(k\) is the number of parameters in the model, \(n\) is the number of data points, and \(L\) is again the likelihood function. BIC penalizes free parameters more than the AIC, and while AIC tries to select the best model that describes the data presented, the BIC attempts to select the true model from among a model set. https://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other. However, like AIC, BIC doesn’t naturally fit with the Bayesian framework as it is scored by the likelihood function.

One information criterion that fits naturally within the Bayesian framework is the Widely Applicable Information Criterion (WAIC).

\[ WAIC = 2\sum_{i=1}^nlog\left(\frac{1}{S}\sum_{s=1}^S p(y_i\vert\theta^s) \right) - 2\sum_{i=1}^n V_{s=1}^S(log(p(y_i\vert\theta^s))) \]

Where \(S\) is the number of posterior draws, \(\theta^s\) is the s-th posterior draw for parameters \(theta\), and \(V^S_{s=1}\) is the sample variance

The first half of WAIC uses the posterior distribution of \(\theta\), as opposed to only likelihood estimates for theta as are used in AIC and BIC. This means that the WAIC is fully bayesian.

Both AIC and WAIC can be viewed as ways of estimating how well a model will predict future data.

Bayesian Model Averaging

Bayesian Model Averaging (BMA) offers an alternative practice that helps ensure findings are robust to a variety of model specifications. At its simplest level, BMA assigns priors to potential model specifications and then caluclates posterior distributions for the model itself, in addition to the coefficients within the specification. This is thus an extension of previous Bayesian theory that focuses solely on coefficient estimation.

Let us begin by considering a matrix \(X\) of all the \(n \times p\) potential independent variables to predict a response variable \(Y\). A standard linear analysis would assume \(Y = X \beta + \epsilon\), where \(\beta\) is a coefficient matrix and \(\epsilon\) ~ \(N(0, \sigma^2)\). There are \(q=2^p\) potential model specifications from the model space \(\{M_1, M_2, ...M_q\}\), and in many cases, there is ambiguity about which of these is best. BMA incorporates this uncertainty into the process rather than ignoring it and claiming that the final model is the only option. This leads to greater flexibility in the inferences of the end results [@Montgomery].

Each model \(M_k\) encompasses the likelihood function \(L(Y|\beta_k, M_k)\) of the observed data \(Y\) in terms of a model-specific coefficient matrix \(\beta_k\) with a prior \(\pi(\beta_k|M_k)\). Both the likelihood and priors are conditional on a particular model [@Fragoso]. The posterior distribution for the model parameters is then \[\pi(\beta_k|Y, M_k) = \frac{L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k)}{\int L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k) \; d\beta_k}\]

The above denominator represents the marginal distribution of the observations over all paramteter values specified in model \(M_k\). It is called the model’s marginal likelihood or model evidence and is denoted by \(\pi(Y|M_k)\). BMA now assumes a prior distribution over the model space describing the prior uncertainty over each model’s capability to accurately describe and/or predict the data. This is modeled as a probability density over the model space, with values \(\pi(M_k)\) for \(k = 1, 2, ... q\) [@Fragoso].

Then, the posterior probability of model specification \(M_k\) is \[\pi(M_k | Y) = \frac{L(M_k | Y) \; \pi(M_k)}{\sum_{k=0}^q \; \pi(Y|M_k)\; \pi(M_k)}\].

Pseudo-Bayes?

However, Yao, Vehtari indicate that traditional BMA is extremely sensitive to model priors. As a result, Pseudo-BMA is built off of a different method for model selection, known as Leave-One-Out Cross Validation (LOO) (but is more stable). Both AIC and WAIC approach LOO as sample sizes increase. A single data point \(d_{out}\) is excluded from the dataset while a model \(i\) is trained on the rest. The model is then used to predict on \(d_{out}\), and the residual is calculated. This is repeated muliple times and the error metric is averaged for each model.

While exact LOO requires \(n\) iterations, one for each point in \(y\), Pseudo-BMA reduces computational complexity by taking samples, \(S\) from the posterior distribution. \(w_{i,k}^s\) represents the weights calculated by Pareto Smoothed Importance Sampling (PSIS). Using the PSIS weights with a LOO-styled procedure is coined PSIS-LOO:

(/Users/joshupadhyay/Documents/yaovhetari.jpg) image link probably broken, will fix!

Given a dataset \(y\), models \(M_k\), weights \(w_{i,k}^s\), and parameters\(\Theta\), PSIS-LOO is an efficent way to approximate the log pointwise predictive distributions \(log \text{ }\hat{p} (y_i | y_{-i}, M_K)\).The predictive distributions are then summed over all \(n\) data points in the dataset to get the estimated expected log pointwise predictive density (\(\widehat{elpd}^k\)), for a specific model \(k\).

This is similar to model probabilities using AIC, model probabilities are calculated by

\[ w_k = \frac{exp(\widehat{elpd}^k)}{\sum_{k=1}^Kexp(\widehat{elpd}^k)} \]

In practice, one more step involving bootstrapping is used to deal with additional bias in the estimate. This final model weight \(w_k\) is the approximate probability of model k being to true model.

These model probabilities can be used in a variety of ways. For example, BMA allows for a direct combination of models to calculate combined estimations for parameters, which leads to lower risk predictions than a single model [@Fragoso]. In addition, BMA can be used in model selection by choosing the model with the highest posterior probability. We will focus on this latter application.

Implementation

library(ggplot2)
library(dplyr)
library(rstan)
library(rstanarm)
library(bayesplot)
library(loo)
library(tidyr)
library(ggridges)
library(purrr)

To illustrate Pseudo-BMA in practice, we’ve chosen the famous candy dataset from the fivethirtyeight package:

candy <- fivethirtyeight::candy_rankings
dim(candy)
## [1] 85 13
names(candy)
##  [1] "competitorname"   "chocolate"        "fruity"           "caramel"         
##  [5] "peanutyalmondy"   "nougat"           "crispedricewafer" "hard"            
##  [9] "bar"              "pluribus"         "sugarpercent"     "pricepercent"    
## [13] "winpercent"
head(candy)
## # A tibble: 6 x 13
##   competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer
##   <chr>          <lgl>     <lgl>  <lgl>   <lgl>          <lgl>  <lgl>           
## 1 100 Grand      TRUE      FALSE  TRUE    FALSE          FALSE  TRUE            
## 2 3 Musketeers   TRUE      FALSE  FALSE   FALSE          TRUE   FALSE           
## 3 One dime       FALSE     FALSE  FALSE   FALSE          FALSE  FALSE           
## 4 One quarter    FALSE     FALSE  FALSE   FALSE          FALSE  FALSE           
## 5 Air Heads      FALSE     TRUE   FALSE   FALSE          FALSE  FALSE           
## 6 Almond Joy     TRUE      FALSE  FALSE   TRUE           FALSE  FALSE           
## # … with 6 more variables: hard <lgl>, bar <lgl>, pluribus <lgl>,
## #   sugarpercent <dbl>, pricepercent <dbl>, winpercent <dbl>
summary(candy)
##  competitorname     chocolate         fruity         caramel       
##  Length:85          Mode :logical   Mode :logical   Mode :logical  
##  Class :character   FALSE:48        FALSE:47        FALSE:71       
##  Mode  :character   TRUE :37        TRUE :38        TRUE :14       
##                                                                    
##                                                                    
##                                                                    
##  peanutyalmondy    nougat        crispedricewafer    hard        
##  Mode :logical   Mode :logical   Mode :logical    Mode :logical  
##  FALSE:71        FALSE:78        FALSE:78         FALSE:70       
##  TRUE :14        TRUE :7         TRUE :7          TRUE :15       
##                                                                  
##                                                                  
##                                                                  
##     bar           pluribus        sugarpercent     pricepercent   
##  Mode :logical   Mode :logical   Min.   :0.0110   Min.   :0.0110  
##  FALSE:64        FALSE:41        1st Qu.:0.2200   1st Qu.:0.2550  
##  TRUE :21        TRUE :44        Median :0.4650   Median :0.4650  
##                                  Mean   :0.4786   Mean   :0.4689  
##                                  3rd Qu.:0.7320   3rd Qu.:0.6510  
##                                  Max.   :0.9880   Max.   :0.9760  
##    winpercent   
##  Min.   :22.45  
##  1st Qu.:39.14  
##  Median :47.83  
##  Mean   :50.32  
##  3rd Qu.:59.86  
##  Max.   :84.18

Exploratory EDA

Plotting the top 10 candies by win percentage

candy %>%
  head(20) %>% 
  ggplot() +
  geom_col(aes(y = reorder(competitorname, winpercent), x = winpercent)) +
  theme_minimal() +
  labs(x = "Percent of Head to Heads Won", y = NULL, title = "Top 20 Candies") 

Plot how many candies are each type

candy %>% 
  summarise_if(is.logical, sum) %>% 
  pivot_longer(1:ncol(.), names_to = "type", values_to = "count") %>%
  mutate(type_clean = c("Chocolate", "Fruity", "Caramel", "Peanuty or Almondy", "Nougat", "Crisped Rice Wafer", "Hard", "Bar", "Several Candies in One Bag")) %>% 
  ggplot(aes(y = reorder(type_clean, count), x = count)) +
  geom_col() +
  theme_minimal() +
  labs(y = NULL, x = NULL, title = "Number of Candies with each Attribute")

Plot win percent by cost

candy %>% 
  ggplot(aes(x = pricepercent, y = winpercent)) +
  geom_jitter() +
  theme_minimal() +
  labs(x = "Relative Price", y = "Percent of Head to Heads Won", title = "Price versus Win Rate")

Plot win percent by sugar

candy %>% 
  ggplot(aes(x = sugarpercent, y = winpercent)) +
  geom_jitter() +
  theme_minimal() +
  labs(x = "Sugar Content (Percent)", y = "Percent of Head to Heads Won", title = "Sugar Content versus Win Rate") +
  ylim(0, 100)

Win percent for each candy type

candy %>% 
  select(winpercent, 2:10) %>%
  pivot_longer(2:10, names_to = "type", values_to = "is_type") %>% 
  filter(is_type) %>% 
  select(-is_type) %>% 
  group_by(type) %>% 
  mutate(mean_win = mean(winpercent)) %>% 
  ungroup() %>% 
  ggplot(aes(x = winpercent, y = reorder(type, mean_win), height = stat(density))) + 
  geom_density_ridges(stat = "binline", bins = 20, scale = 0.95, draw_baseline = FALSE) +
  theme_minimal() +
  labs(x = "Percent of Head to Heads Won", y = NULL, title = "Candy Type by Win Percent") + 
  scale_y_discrete(labels = rev(c("Crisped Rice Wafer", "Peanuty or Almondy", "Bar", "Chocolate", "Nougat", "Caramel", "Several Candies in One Bag", "Fruity", "Hard")))

We start by creating regular stan models. Each model differs by the number of variables considered, in this case. These posterior distributions are approximated via MCMC.

set.seed(454)
model_1 <- stan_glm(winpercent ~ sugarpercent, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_2 <- stan_glm(winpercent ~ sugarpercent + pricepercent, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_3 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_4 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + pricepercent, 
                data = candy,
                family = gaussian, 
                chains = 4, iter = 2*5000, refresh = FALSE)
model_list <- list(model_1 = model_1, model_2 = model_2, model_3 = model_3, model_4 = model_4)

Lets examine the chains

mcmc_trace(model_1)

mcmc_dens_overlay(model_1)

mcmc_trace(model_2)

mcmc_dens_overlay(model_2)

mcmc_trace(model_3)

mcmc_dens_overlay(model_3)

mcmc_trace(model_4)

mcmc_dens_overlay(model_4)

Examining the stability of our models: Ssimulations are somewhat but not super stable. There are a few noticeable outliers in every model, as well as a relatively wide range of packed simulations. They seem to roughly fit, though go a little high overall.

pp_check(model_1)

pp_check(model_2)

pp_check(model_3)

pp_check(model_4)

mcmc_trace(model_1)

mcmc_trace(model_2)

mcmc_trace(model_3)

mcmc_trace(model_4)

As we are satsified, we can proceed with these models. Here are the coefficients:

model_1$coefficients
##  (Intercept) sugarpercent 
##     44.67007     11.81196
model_2$coefficients
##  (Intercept) sugarpercent pricepercent 
##    39.802262     6.722496    15.580509
model_3$coefficients
##          (Intercept)        chocolateTRUE           fruityTRUE 
##           35.2393393           19.6522822           10.0194746 
##          caramelTRUE   peanutyalmondyTRUE           nougatTRUE 
##            3.3625741            9.9863058            2.3087578 
## crispedricewaferTRUE             hardTRUE              barTRUE 
##            8.8676580           -4.8126445           -0.5306398 
##         pluribusTRUE 
##           -0.1460981
model_4$coefficients
##          (Intercept)        chocolateTRUE           fruityTRUE 
##           34.7283883           19.5438496            9.1570800 
##          caramelTRUE   peanutyalmondyTRUE           nougatTRUE 
##            2.2049288            9.9400899            0.7589168 
## crispedricewaferTRUE             hardTRUE              barTRUE 
##            8.7610339           -6.0805383            0.5256453 
##         pluribusTRUE         sugarpercent         pricepercent 
##           -0.8447881            9.1226100           -5.7884718

Model Evaluations using ELPD

Once the models are calculated, we calcuate Leave One Out expected log point predictive densities (ELPD LOO) for each model.

Calculate

"Model 1 Loo"
## [1] "Model 1 Loo"
(loo_1 <- loo(model_1))$estimates
##             Estimate         SE
## elpd_loo -349.299296  5.8136277
## p_loo       2.708551  0.5166705
## looic     698.598592 11.6272553
"Model 2 Loo"
## [1] "Model 2 Loo"
(loo_2 <- loo(model_2))$estimates
##             Estimate         SE
## elpd_loo -346.734615  6.6778576
## p_loo       4.051253  0.9045871
## looic     693.469229 13.3557153
"Model 3 Loo"
## [1] "Model 3 Loo"
(loo_3 <- loo(model_3))$estimates
##            Estimate        SE
## elpd_loo -329.91156  5.818139
## p_loo      10.57069  1.541516
## looic     659.82312 11.636278
"Model 4 Loo"
## [1] "Model 4 Loo"
(loo_4 <- loo(model_4))$estimates
##            Estimate        SE
## elpd_loo -330.25141  5.930144
## p_loo      12.93009  1.862167
## looic     660.50283 11.860288
lpd_point <- cbind(
  loo_1$pointwise[,"elpd_loo"], 
  loo_2$pointwise[,"elpd_loo"],
  loo_3$pointwise[,"elpd_loo"],
  loo_4$pointwise[,"elpd_loo"]
)

With the ELPD calculations, the models are able to be ranked in order of their contribution, as a percentage. As shown, model3 has the highest weight with 0.534, followed by model4. By this metric, model3 appears to be the best model from the set of 4 we created above.

(weights <- pseudobma_weights(lpd_point))
## Method: pseudo-BMA+ with Bayesian bootstrap
## ------
##        weight
## model1 0.001 
## model2 0.006 
## model3 0.552 
## model4 0.441

Predictions with Pseudo-BMA

A new type of candy is created (new_candy), which is turned into a dataframe for prediction.

new_candy <- data.frame(chocolate = TRUE, fruity = TRUE, caramel = TRUE, peanutyalmondy = TRUE, nougat = TRUE, crispedricewafer = TRUE, hard = FALSE, bar = FALSE, 
                        pluribus = FALSE, sugarpercent = 0.20, pricepercent = 0.9)
my_predict <- function(model) {
  posterior_predict(model, newdata = new_candy)
}
make_df <- function(predictions, index) {
  data.frame(winpercent_new = predictions[,1], model = index)
}
predictions <- map(model_list, my_predict) %>% 
  map2(1:4, make_df)

Using the weights we calculated from using the pseudobma_weights() function to obtain ELPD scores, we can then generate sample predictions and visualize them, like so:

sampled_pred <- predictions %>% 
  map2(weights, sample_frac) %>% 
  bind_rows() %>% 
  mutate(model = as.character(model))
sampled_pred %>% 
  ggplot(aes(x = winpercent_new, fill = model, color = model)) +
  geom_density_ridges(alpha = 0.2, aes(y = model)) +
  theme_minimal()
## Picking joint bandwidth of 3.32

Unsurprisingly, the different models provided slightly different predictions. The \(MAP\) ‘winpercent’ for model, model2 appear to be closer to 45-50%, while the model3, model show a ‘winpercent’ of closer to 80%.

sampled_pred %>% 
  ggplot(aes(x = winpercent_new, fill = model, color = model)) +
  geom_histogram(alpha = 0.9, position = "stack") +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As a more visual demonstration of the weights, the contribution of each model is evident in the predictive distribution, with model 3 contributing the largest, followed by model 4.

Conclusion? Final Thoughts?